feat(compiler+recorder): contenteditable typing capture, trace_viewer view fixes, SVG-clickable highlight, and dual default-alias config by softpudding · Pull Request #66 · softpudding/OpenBrowser

softpudding · 2026-04-25T02:06:09Z

Summary

This branch lands several related changes uncovered while debugging a real
recording where text typed into Yuque's contenteditable document body was
silently dropped between recorder and compiler. Fixing that one bug surfaced
gaps at every layer of the recording → compile → replay pipeline, which this
branch addresses end-to-end. It also brings forward a few earlier
extension/highlight fixes and a server-side LLM-config split that were
already on the branch.

The 8 commits group into four themes:

1. Recorder + compiler: capture and surface contenteditable typing

Rich-text editors (Yuque's Lake editor, ProseMirror, Slate, Lexical, TipTap)
intercept keystrokes at keydown + preventDefault and apply edits via
their own DOM model, so native input events never fire on the body. The
old extension input listener also explicitly filtered to
HTMLInputElement/HTMLTextAreaElement only — so typing into a Yuque doc
produced zero input events in the trace. Even when surrounding
keydown Enter events captured an HTML snapshot, the compiler-side
trace_viewer truncated input values at 80 chars in the events view and
omitted them entirely from normalized_steps, so user instructions typed at
the end of a body (after URL paste) never reached the LLM.

Recorder (extension/src/content/index.ts, commit 8f11aa6):

Broaden the input listener filter to also accept contenteditables.
New beforeinput listener for contenteditable-only — captures keystrokes
before the editor intercepts, with the DOM snapshot deferred to a
microtask so the serialized value reflects post-mutation state.
New isContentEditableElement() / getContentEditableText() helpers in
serializeElement() populate value/valueLength/isContentEditable: true
for contenteditable targets.
coalesce_typing_events upstream (fa4913b) folds runs of consecutive
typing events on the same element into one — keeps event_detail working
on any absorbed index but stops a 100-keystroke burst from burying every
click in noise.

Compiler (server/core/compiler_agent.py + server/core/workflow_compiler.py):

_format_value_with_tail(value, head=200, tail=200) renders long input
values with both ends visible and a …(N more chars; use event_detail)…
middle marker in the events view (replaces hard 80-char head-only
truncation).
_handle_normalized_steps surfaces a per-anchor
field <selector> final_value="…" line for [form] steps, picking the
latest snapshot in event order (paste-then-trim, backspace-heavy edits,
and clear-and-rewrite all break the longest-wins heuristic).
_extract_input_value falls back to _extract_visible_text_from_html on
element.html (then to element.text) when value is missing — recovers
contenteditable text from older traces too. Sensitive fields still bypass
the fallback.
system_prompt_compiler.j2 updated (in agent-sdk PR via commit
32e6edba over there, mirrored in the venv copy on this branch via
local_vendor cleanup) to describe the new render markers and the
final_value line semantics for contenteditable bodies.

Tests (new):

server/tests/unit/test_workflow_compiler_contenteditable.py (8 tests) —
HTML fallback, malformed input, script/style skipping, sensitive-field
refusal, plain-text fallback.
server/tests/unit/test_compiler_agent_value_view.py (6 tests) —
_format_value_with_tail edges + _handle_normalized_steps form-step
final-value rendering with paste-then-trim coverage.
server/tests/unit/test_coalesce_typing_events.py (from fa4913b).

2. Extension fixes already on the branch

83d27c0 fix(extension): type via CDP Input.dispatchKeyEvent per character
274c37a fix(highlight): detect SVG graphics elements as clickable

274c37a is the primary cause of the bidirectional movement seen in §4
below: it was needed for mapquest_nearby_pins (where <circle> pins were
invisible to highlight scan) and yields +22 summed score across the four
models on that test alone, but produces a small rubric side-effect on
bluebook_simple (where the agent now likes the SVG heart from the search
card, skipping the note_open rubric criterion).

3. Compiler-default-alias config + UI surface

224d025 feat: separate compiler-agent default LLM from general agent default
— server/core/llm_config.py, server/api/routes/config.py, frontend
surface and tests. Lets the compiler use a stronger model than the runtime
agent (e.g. plus for compile, flash for execute).
3ff942b chore: remove stray local_vendor/ directory —
removes a stale checkin of the agent-sdk system prompt that diverged from
the upstream copy.

4. Eval scaffolding + reports

eval/routine_eval/fixtures/github-trending-contenteditable-question/
— new fixture pinning the regression. intent_note.txt (1 line),
raw_intention.md (history + ground truth), expectations.yaml (required
position-vs-identity question; forbids "what text did the user type"; the
expected_routine_content block requires the routine to mention all three
agent-investigation prompts).
eval/evaluation_report.json — refreshed benchmark from the 2026-04-24
full eval (105/140 PASSED, 75.0%).
eval/routine_eval/evaluate_routine_compile.py — namespaces the
canonical regression report by compile_alias so a multi-model loop
produces compile_evaluation_report_<alias>.json per run instead of every
run clobbering the same file.
4 per-model canonical compile-eval reports
(qwen3{5,6}{plus,flash}-fast.json).
skill/claude/ob-routines/SKILL.md — fixes for the recording skill that
surfaced while running this branch end-to-end: tmux launch keeps the
window alive via exec zsh, Monitor template now detects pane-gone,
[compiler:saved] is verified via list_routines.py, and the gate
reasoning is explicitly Claude's responsibility to write out as
user-visible text.

Eval results (2026-04-24 full run, 4 × `-fast` models, 35 tests × 4 = 140 runs)

Model	Pass rate	Score / Max	Δ score vs main
qwen3.5-plus	85.7%	274.3 / 304.8	−1.9
qwen3.5-flash	60.0%	232.2 / 304.8	−10.9
qwen3.6-plus	74.3%	251.4 / 304.8	−11.0
qwen3.6-flash	80.0%	274.5 / 304.8	+1.5

Raw total delta: −21.8 / 1219.2 (−1.8%)
Infra-adjusted delta (subtracting two confirmed 0.0-score
400 Bad Request infra kills + one LLMBadRequestError mid-flow):
−4.8 / 1219.2 (−0.4%) — within stochastic range.

Bidirectional movement (the dominant pattern):

4 of the top-5 improvements are on mapquest_nearby_pins
(+3.0 / +5.5 / +6.0 / +7.5 across models — exactly the test that
motivated 274c37a).
The largest regression is bluebook_simple on qwen3.5-flash (−2.0)
— a known rubric-coupled side effect of the same SVG-clickable change.
Net 274c37a effect across the suite: ~+22 mapquest gains, ~−2 to ~−4
bluebook costs. Trade-off is real but heavily positive.
The please_help_me tool is observed as a soft killswitch in eval mode
(gmail_vendor_escalation, two models). Recommend a future harness fix to
auto-reject the call so the agent doesn't stall waiting for a human that
never comes.

Full root-cause analysis with per-failure entries (F1–F25) lives in
tmp/observation_notes_20260424_100152.md and the rolled-up report at
tmp/OBSERVATION_REPORT_20260424_100152.md on this branch checkout (not
committed; the artifacts are reproducible from the eval command line in
the report header).

The compiler/recorder work itself does not show any regression on the
agent-loop eval — those changes only affect the compile path, not agent
execution.

Routine-compile eval (4 × `-fast` models, 3 fixtures × 4 = 12 runs)

Per-model canonical reports now in eval/routine_eval/. Pass rates:

Model	Pass	Notes
qwen3.5-plus	1/3	New fixture: `intent_match=1.0` (was 0.4 pre-fix) — confirms the trace_viewer changes reach the LLM.
qwen3.6-plus	2/3	Best of the four.
qwen3.5-flash	0/3	Consistent across runs — flash drift on multi-step tasks.
qwen3.6-flash	1/3

The new github-trending-contenteditable-question fixture is genuinely
hard — even when the model picks up the typed instructions correctly
(intent_match=1.0 on qwen3.5-plus), it can still fail on Keywords
placement or asking-behavior. That is by design: the fixture is a
multi-axis stress test of the contenteditable pipeline.

Dependency

agent-sdk: pinned to 32e6edba2178eac73afea6d0a3bdf452d621394a on the
open-browser branch — that commit contains the matching prompt update
(feat(compiler): surface long input values and form final_value in trace_viewer). pyproject.toml and uv.lock updated, lock matches the
pin.

Test plan

uv run pytest -q — 499 passed, 4 skipped, 6 warnings
npm --prefix extension test — 195 pass / 0 fail / 564 expect() calls
Pre-commit (black + prettier + eslint + check-toml + check-yaml) on
every file touched on this branch
Real-recording end-to-end: recorded a live Yuque doc edit on this
branch; the typed "Write also: 1. A brief intro 2. What's special 3. Why's it trending" is now a first-class chunk of the trace and
the compiler agent recognises it as agent-investigation prompts on
its own without manual gate feedback (qwen3.5-plus run, see
eval/routine_eval/fixtures/github-trending-contenteditable-question/).

🤖 Generated with Claude Code

The Compiler Agent was falling through to `default_llm_alias`, which is typically a small/cheap model (qwen3.5-flash) unsuitable for compiling recordings into routines. Introduce `default_compiler_alias` so operators can point the compiler at a stronger model (e.g. qwen3.5-plus) without changing the agent default. Empty/unset falls back to the agent default. - server: add AppConfig.default_compiler_alias, get_compiler_llm_config(), set_default_compiler_alias(); route compiler_agent and /recordings compile pre-validation through the new resolver. - api/config: surface and accept default_compiler_alias; validate against submitted aliases before persisting (avoids half-saved state on 400). - skill/ob-routines: Claude queries /api/config and, if default_compiler_alias is unset, picks the best available alias (plus > flash, avoid coding endpoint's tighter quota) and passes it via --model-alias. Always reports the chosen model to the user. - frontend: Compiler-default dropdown in the model settings panel (— use agent default — + one option per configured alias), synced as aliases are added/renamed/removed. - tests: new test_llm_config_manager covers alias selection, fallback, and auto-reset when the configured alias is removed; route tests cover POST validation ordering and persistence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sites like Zhihu rotate hot-topic text into the search input on a timer, pausing only when they see real keydown/keyup events. The old path set `el.value = text` and dispatched a synthetic `input` event; no keystroke events fired, so the rotator kept ticking and clobbered typed text during the LLM turn gap — the Zhihu-search-AI bug. performKeyboardInput now: - focuses and clears the editable via JS (keeps existing activation helpers for labels and shadow roots) - types each character via Input.dispatchKeyEvent, so keydown, input, and keyup all flow through Chromium's native input pipeline. A US keyboard layout table covers letters, digits, space, and ~30 punctuation/shifted symbols with correct DOM code and virtual key; non-ASCII falls back to `char` insertion. - verifies document.activeElement actually landed on (or inside) the target before handing off to CDP, failing loudly otherwise. - re-runs validateCachedElement during readback so a rerendered replacement node surfaces as stale instead of phantom success. Verified end-to-end against Zhihu (search "AI" now submits "AI") and DuckDuckGo with punctuation ("C++ @2026.04 vs. Rust?" round-trips through the URL unchanged). Also re-ran the full 4-model eval (140 runs): qwen3.6-flash gained +10.8 task points and ran ~30s faster per test, consistent with no longer losing time to rotator clobber.

Map pins (SVG <circle>/<rect> children of <g> inside <svg>), icon toggles drawn directly in SVG, and chart markers can have their own cursor:pointer and click listener without an HTML wrapper. The prior detection pipeline dropped them on two gates: 1. isMeaningfulPointerCandidate rejected non-HTMLElements outright, so the pointer-cursor signal never registered for SVG leaves. 2. resolveClickableCandidate walked from an SVG element to its parentElement and bailed when that parent was also an SVG (as it always is for pins inside <g>-wrappers). Fix: - Accept SVGElement in isMeaningfulPointerCandidate; the size/area heuristics below already work on SVG bboxes (the SVGGraphicsElement.prototype.getBoundingClientRect patch at the top of the scan makes layout reads fast and consistent). - When resolveClickableCandidate is handed an SVG graphics element that classifies as clickable on its own, return it directly as a standalone candidate. Fall through to the existing HTML-ancestor walk only for decorative SVG children of interactive HTML wrappers (<button><svg>…</svg></button>) — behaviour unchanged for that case. Surfaced by the mapquest_nearby_pins evaluation test: pins rendered as <circle class="map-pin"> with cursor:pointer + addEventListener click were never picked up in the highlight scan, so the agent had no element_id to target. All 4 qwen models scored 4.5/12 on that test both on main and on branch, a stable 4.5-point floor that the pin-detection gap explained. Verified end-to-end via open-browser skill against the mapquest mock site (served with the shared /js/tracker.js dependency so the page actually renders pins): 8 pins detected, agent clicks the Space Needle pin and the place-detail panel opens with name, rating, hours, address, website, phone. Existing highlight-detection / highlight-any / element-actions-regression suites all green.

The single file under local_vendor/openhands-sdk/ was unreferenced by pyproject.toml, uv.lock, or any source — the SDK is consumed via the uv git source, not a local vendor tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inal keyframe on stop Two independent wins for the compiler-agent's trace view, both surfaced by recording aab8711b on the latest session: 1. Merge consecutive keystrokes. The recorder emits one `input` event per keystroke (plus `beforeinput` on contenteditable rich editors), so a 10-letter title produces 10+ near-identical events. On a real Yuque recording this turned a 266-event trace into 123 input events out of the total — every actual action (click, navigation, drag) was buried in typing noise. New helper `coalesce_typing_events` in workflow_compiler walks events in order; runs of consecutive `input`/`change`/`beforeinput` on the same element identity collapse into the last event in the run. The survivor carries the final text snapshot and picks up `coalescedCount` + `coalescedEventIndexes` annotations so the agent can drill back into any single keystroke via `event_detail`. Identity uses a new `_stable_element_identity` (selector + ARIA label + placeholder + container selector) — the existing `_element_identity` folds `element.text` into the hash, which is exactly what changes between keystrokes and defeats coalescing. `TraceViewerExecutor` now applies the coalescer in its constructor and presents the folded list via `events`/`summary`; `_events_by_index` still indexes the full raw list so `event_detail` works on absorbed indexes. `summary` calls out the raw→coalesced delta and each listed event gets a `[coalesced ×N]` tag when more than one was folded. Smoke-tested against recording 589eb0e8 (266 events, 123 inputs): after coalescing, 7 input events remain (one per typing burst on one element); non-typing events are untouched. 2. Capture a final keyframe on `recording_stopped`. `stopRecording` picks the scope's currently-active tab (falling back to any recordable tab in scope) and calls `buildRecordingKeyframe` before the debugger session is torn down. The keyframe rides on the `recording_stopped` event's `event_data.keyframe` slot, so the existing trace viewer keyframe-count + `event_detail` image display works without further changes. Failures are logged and swallowed — stop must never block on screenshot flakiness. Tests: new `test_coalesce_typing_events.py` (8 cases covering folding, run separation, keyframe promotion, order preservation); existing `test_workflow_compiler_contenteditable.py` still green; recorder bun-test suite still passes with the new stop-time keyframe call (gracefully no-ops when Chrome debugger API isn't available in the harness).

…t in trace_viewer Three layered fixes so that text typed into rich-text editor bodies (Yuque/ Lake editor and similar) reaches the compiler agent, plus an eval fixture pinning the regression and SKILL.md improvements that surfaced from running the recording→compile pipeline end-to-end. Recorder (extension/src/content/index.ts): - input listener now also matches contenteditable targets, with a new isContentEditableElement helper and a getContentEditableText helper that populates the serialized value from innerText. - New beforeinput listener for contenteditable targets only — covers rich editors (Lake, ProseMirror, Slate, Lexical, TipTap) that intercept keydown + preventDefault and synthesize edits via their own DOM model so native input events never fire on the body. The DOM snapshot is deferred to a microtask so it reflects the post-mutation state. Compiler view (server/core/compiler_agent.py): - _format_value_with_tail renders long input values with both ends visible and the middle elided as "<head> ...(N more chars; use event_detail)... <tail>". Replaces the hard 80-char head-only truncation that hid user-typed instructions appearing late in the value. - _handle_normalized_steps now surfaces a "field <selector> final_value=..." line per anchor for [form] steps, picking the latest snapshot in event order. The previous summary showed only step type and event indexes, hiding the actual typed content. Eval fixture (eval/routine_eval/fixtures/github-trending-contenteditable-question/): - Real recording (5c5cf4f5) where the user types instructions into the Yuque body for the replay agent to follow. expectations.yaml encodes the position-vs-identity ambiguity and forbids asking the user to retype visible content while leaving intent-clarification questions legitimate. Tests (server/tests/unit/): - test_workflow_compiler_contenteditable.py covers the html-fallback path in _extract_input_value (introduced earlier in this branch). - test_compiler_agent_value_view.py covers _format_value_with_tail edge cases and the latest-in-event-order field picker, including paste-then-trim / clear-and-rewrite flows where longest-wins would surface stale text. SKILL.md (skill/claude/ob-routines/SKILL.md): - tmux launch keeps the window alive via "exec zsh" so [compiler:saved] and [compile-done] markers don't get lost when the window auto-closes on python exit. - Monitor template detects pane-gone and emits a terminal event so a silent dead-pane poll loop is no longer possible. - Adds a verify-after-saved step using list_routines.py. - Quality-gate section now states explicitly that the gate reasoning is Claude's judgment, must be written as user-visible text before pressing Enter, and the compiler's wrap-up message is not a substitute. Pin agent-sdk to commit 32e6edba (matching prompt update for the new trace_viewer rendering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Full eval on feat/compiler-default-alias with qwen3.{5,6}-{plus,flash}-fast. 105/140 passed (75.0%), raw score delta −21.8 vs main (−1.8%); infra-adjusted −4.8 (−0.4%). See tmp/OBSERVATION_REPORT_20260424_100152.md for full root-cause analysis (T1 SVG-clickable trade-off, T2 please_help_me eval killswitch, flash instruction drift, etc). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…alias The canonical version-controlled report at `eval/routine_eval/compile_evaluation_report.json` is overwritten by every run, so a multi-model eval loop (4 -fast models) ends up keeping only the last run's data — which happened to be the weakest model and looked like "all cases failed". When `--compile-alias` is given, write the canonical copy to `compile_evaluation_report_<alias>.json` instead so each model in a loop preserves its own baseline. The unsuffixed path is preserved as the default for runs that use the server's default alias, so existing dashboards and CI flows are unaffected. Also commits the four per-model reports from a fresh rerun on 2026-04-24: qwen35plus-fast : 1/3 pass (intent_match peaks at 1.0 on the new contenteditable fixture, confirming the trace_viewer fixes from 8f11aa6 reach the LLM) qwen36plus-fast : 2/3 pass qwen35flash-fast : 0/3 pass (consistent across runs — flash drift) qwen36flash-fast : 1/3 pass Per-test failure breakdown is in tmp/observation_notes_20260424_100152.md plus the chat record from the rerun. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-commit autoformatted black (Python) and prettier (TS) over the files touched on this branch. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding and others added 9 commits April 21, 2026 13:27

chore: apply black + prettier formatting to bring CI clean

5ab70e2

Pre-commit autoformatted black (Python) and prettier (TS) over the files touched on this branch. No semantic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding merged commit d12340a into main Apr 25, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(compiler+recorder): contenteditable typing capture, trace_viewer view fixes, SVG-clickable highlight, and dual default-alias config#66

feat(compiler+recorder): contenteditable typing capture, trace_viewer view fixes, SVG-clickable highlight, and dual default-alias config#66
softpudding merged 9 commits into
mainfrom
feat/compiler-default-alias

softpudding commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

softpudding commented Apr 25, 2026

Summary

1. Recorder + compiler: capture and surface contenteditable typing

2. Extension fixes already on the branch

3. Compiler-default-alias config + UI surface

4. Eval scaffolding + reports

Eval results (2026-04-24 full run, 4 × -fast models, 35 tests × 4 = 140 runs)

Routine-compile eval (4 × -fast models, 3 fixtures × 4 = 12 runs)

Dependency

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Eval results (2026-04-24 full run, 4 × `-fast` models, 35 tests × 4 = 140 runs)

Routine-compile eval (4 × `-fast` models, 3 fixtures × 4 = 12 runs)